In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import warnings
import pickle

from sklearn import preprocessing
from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.metrics import *

le = preprocessing.LabelEncoder()
warnings.filterwarnings("ignore")

pd.set_option('display.max_rows',None)
pd.set_option('display.max_columns',None)

Introduction

Purpose and general description

This work was created with the goal of trying to find a way to predict adverse weather events. Given my lack of domain knowledge, I was asked to treat the problem as "agnostic" as possible: can "math/statistics" alone find a pattern in the data?

Therefore, it becomes necessary to perform this task by answering a precise question, a clear objective, which I identified as follows: can the collected meteorological data be anticipatory of a possible extreme event? At first glance this work lends itself to an unsupervised analysis: however, not having sufficient knowledge to assess the goodness of this alternative, I preferred to bring the problem in the "supervised" side.

To do this I've created a label in correspondence of the days with a quantity of rain greater than 50mm/day, threshold indicated by the Swiss Meteorological Center as "level of attention" for the data area provided (Domodossola).

Since the data are provided with a delay of one day ("today" I get the data of "yesterday") it is not correct and useful to classify those days according to the occurrence of extreme events. Rather I think it is useful to classify the day before the occurrence of the extreme event, to understand if with the data of the previous day we can "anticipate" the occurrence of a particularly adverse event.

Techincal environment

This work has been conducted in a "data science" POC fashion, being focused more on the methodologies aspects than on the technical ones. This aspect leaves unresolved many important issues that will have to be addressed in the eventual production phase of the work.

For the time being, it is worth noting that the libraries used for this POC are the standard Sklearn ones, which could make integration with major cloud facilities (AWS, Azure, GCP) easier. In addition, the training phase was built leveraging the Sklearn Pipeline, in order to get as close as possible to the MLOps paradigms (via mlflow, for example). If necessary, further information on versions and libraries actually used can be provided.

Use case

This is the use case thought:

1) Data collected from IoT sensor and drones arrives regarding previous day.

2) The classifier is launched to answer the following question: yesterday was one of those days "anticipating" an extreme event?

3) If yes, weather alert activated!

Next steps

Future improvements:

1) Use a smaller time range (hours, minutes, etc...).

2) Better pipeline method for capturing detailed aspects (with customize functions).

2.1) Oversampling data with artificial data generator systems.

2.2) Filling empty columns with EM algorithm.

2.3) Rescaling features passing inverse-beta distribution (more useful when data presents highly skewness).

3) Use a time series modeling that can keep track of any seasonality in the data.

4) Implement adversarial validation for checking distribution shift between train and test set.

5) Make it run real time to make prediction from any interval about next 24 hours.

1.0 Data retrival and ETL

In [2]:
df = pd.read_excel('0_dataset/DOMODOSSOLA_giornalieri_1988_2021.xlsx')
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12110 entries, 0 to 12109
Data columns (total 15 columns):
 #   Column                                Non-Null Count  Dtype         
---  ------                                --------------  -----         
 0   Data                                  12110 non-null  datetime64[ns]
 1   Precipitazione dalle 9 alle 9 ( mm )  11878 non-null  object        
 2   Precipitazione dalle 0 alle 0 ( mm )  12023 non-null  object        
 3   Temperatura media ( °C )              12084 non-null  object        
 4   Temperatura massima ( °C )            11893 non-null  object        
 5   Temperatura minima ( °C )             11884 non-null  object        
 6   Umidita' media ( % )                  11489 non-null  float64       
 7   Umidita' massima ( % )                11291 non-null  float64       
 8   Umidita' minima ( % )                 11291 non-null  float64       
 9   Velocita' media ( m/s )               11067 non-null  object        
 10  Raffica ( m/s )                       10860 non-null  object        
 11  Durata Calma ( min )                  11067 non-null  float64       
 12  Settore Prevalente                    7157 non-null   object        
 13  Tempo Permanenza ( min )              7155 non-null   float64       
 14  Radiazione totale ( MJ/mq )           11238 non-null  object        
dtypes: datetime64[ns](1), float64(5), object(9)
memory usage: 1.4+ MB

Set the "Date" column as index and correct the formatting of some fields (from object to float).

In [4]:
df = df.set_index('Data')
In [5]:
for i in df:
    if df[i].dtypes == 'object':
        if i == 'Settore Prevalente':
            pass
        else:
            df[i] = df[i].str.replace(',', '.').astype(float)

Encoding the only categorical variables into a numerical version, saving original NaN's rows.

In [6]:
df['Settore Prevalente'] = df['Settore Prevalente'].astype(str)
df['Settore Prevalente'] = le.fit_transform(df['Settore Prevalente'])
df['Settore Prevalente'] = df['Settore Prevalente'].mask(df['Settore Prevalente']==10,np.nan)

Let's have a look at the pairwise relationship in the data.

In [7]:
sns.pairplot(df)
Out[7]:
<seaborn.axisgrid.PairGrid at 0x1b26bed3f60>

2.0 Data labelling

As written in the introduction of the paper, two new labels were created in order to answer the question posed at the beginning of the paper: are yesterday's data anticipatory of an extreme climate event?

To do this the following steps are necessary:

1) creation of the label in correspondence of the "extreme" event (mean value of rain preciptations between two hours range);

2) creation of the label corresponding to the day before the occurrence of the extreme event (the target of the classifier).

In [8]:
df['precipitazione_media'] = df[['Precipitazione dalle 9 alle 9 ( mm )','Precipitazione dalle 0 alle 0 ( mm )']].mean(axis=1)
In [9]:
df['evento_estremo'] = 0
df['evento_estremo'] = df['evento_estremo'].mask(df['precipitazione_media']>=50,1)
In [10]:
df['anticipatore'] = 0
df['anticipatore'] = df['anticipatore'].mask(df['evento_estremo']==1,1).shift(-1)
df['anticipatore'].value_counts(normalize=False)
Out[10]:
0.0    11928
1.0      181
Name: anticipatore, dtype: int64

3.0 Data preparation

3.1 Class imabalance

Given the large imbalance between classes present, as shown above, it is necessary to treat the problem second:

  • anomaly detection, or
  • binary classifier after opportune oversampling or undersampling

The use of anomaly detection, being an alternative close to the unsupervised world, I do not consider it for the moment. It's been started with the binary classifier, with the undersampling of the majority class: in this way I avoid to create bias that could occur in the case of oversampling of the minority class (through the creation of data, perhaps, too much artificial).

In [11]:
df_under = df[df['anticipatore']==0]
df_over = df[df['anticipatore']==1]
df_under = df_under.sample(frac=df_over.shape[0]/df_under.shape[0],random_state=0)
df = df_over.append(df_under).sample(frac=1)

3.2 Edit column format

In [12]:
df['Settore Prevalente'] = df['Settore Prevalente'].astype(str)

3.2 Data partitiong

In [13]:
print('Dimension of the dataset: ',df.shape)
Dimension of the dataset:  (362, 17)

Since we are dealing with a relatively small dataset, it is not advisable to proceed with the creation of a "classic" train, dev and test set.

For this reason will be implemented only the division into train and test set. Then, in order to manage the tradeoff between bias and overfitting, the train set will undergo a stratified cross-validation procedure together with a search for the best hyperparameters using GridSearch.

In order to avoid the occurrence of data-leakage during the cross-validation it is necessary to implement a pipeline that performs the next steps, in an iterative way, for each fold created by the cross-validation procedure. Here are the steps:

1) Filling column holes with NaN via KNN (holes are filled based on the rows closest to the row under consideration)

2) Re-scaling of the variables

3) Classifier (a classic gradint boosting).

4.0 Modeling

4.1 Training

In [14]:
X = df.drop(columns=['anticipatore','evento_estremo']).values
y = df['anticipatore'].values
In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.15 , random_state = 1)

Build the Pipeline

In [16]:
steps = list()
steps.append(('imputation',KNNImputer(n_neighbors=5,weights='distance')))
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', GradientBoostingClassifier(random_state=1)))
pipeline = Pipeline(steps=steps)

Parameters of the GridSearch

In [17]:
parameters = {
    'imputation__n_neighbors':[3,5,10],
    'model__learning_rate':[0.05,0.1,0.5]
}

Training model with stratified cross-validation and GridSearch. The metric choose is the accuracy, since in this first phase the difference between the "cost" of false positives or false negatives is indifferent. Otherwise it would have been necessary to choose between Precision and Recall (or their harmonic mean, known as F1 metric).

In [18]:
k = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=1)
model = GridSearchCV(pipeline,param_grid=parameters,cv=k,scoring='accuracy').fit(X_train, y_train)

Ouput of the cross-validation

In [19]:
print("Best Estimator: \n{}\n".format(model.best_estimator_))
print("Best Parameters: \n{}\n".format(model.best_params_))
print("Best Test Score: \n{}\n".format(model.best_score_))
print("Mean Test Scores: \n{}\n".format(model.cv_results_['mean_test_score']))
Best Estimator: 
Pipeline(steps=[('imputation', KNNImputer(n_neighbors=3, weights='distance')),
                ('scaler', MinMaxScaler()),
                ('model',
                 GradientBoostingClassifier(learning_rate=0.5,
                                            random_state=1))])

Best Parameters: 
{'imputation__n_neighbors': 3, 'model__learning_rate': 0.5}

Best Test Score: 
0.9033462360132835

Mean Test Scores: 
[0.90008884 0.89791019 0.90334624 0.88921674 0.89248472 0.89031664
 0.8816655  0.88815915 0.8914377 ]

The results of this first test are promising! Before passing the model to the test set it is necessary to verify that there is not the phenomenon of "distribution shift" that occurs when the data destruction of the train set is very different from that of the dev or test set. To verify this phenomenon three alternatives are possible:

1) classical statistical tests (bayesian t-test for example) 2) adversarial validation 3) histogram comparison

For this POC it has been decided to implement the last version as a sort of baseline. If the average result provided by the next function is greater or arounf than 0.5 we can assume that the difference between the distributions of the train and the test set is marginal.

In [20]:
def histogram_intersection(hist_1, hist_2):
    minima = np.minimum(hist_1, hist_2)
    intersection = np.true_divide(np.sum(minima), np.sum(hist_2))
    return intersection
In [21]:
h1 = pd.DataFrame(X_train)
h2 = pd.DataFrame(X_test)
In [22]:
h1 = h1.replace('nan',np.nan)
h2 = h2.replace('nan',np.nan)
h1 = h1.fillna(0.0)
h2 = h2.fillna(0.0)
In [23]:
inters = []
for i in zip(h1,h2):
    h1[i[0]] = h1[i[0]].astype(float)
    h2[i[1]] = h2[i[1]].astype(float)
    max1 = np.max(h1[i[0]])
    max2 = np.max(h2[i[1]])
    inter = histogram_intersection(h1[i[0]],h2[i[1]])
    inters.append(inter)
print('Mean value of the intersection of all the columns is: ',np.mean(inters))
Mean value of the intersection of all the columns is:  0.5497313300551955

Given the small dataset we're facing, let's assume train and test do not present "too much" distribution shift.

4.2 Testing

In [24]:
y_test_predict_grid = model.predict(X_test)
print("Model Test Recall:", metrics.recall_score(y_test, y_test_predict_grid))
print('--------------------------------------------------')
print("Model Test Precision:", metrics.precision_score(y_test, y_test_predict_grid))
print('--------------------------------------------------')
print("Model Test F1:", metrics.f1_score(y_test, y_test_predict_grid))
print('--------------------------------------------------')
print("Model Test Accuracy:", metrics.accuracy_score(y_test, y_test_predict_grid))
print('--------------------------------------------------')
print('Model Confusion Matrix')
cm = confusion_matrix(y_test,y_test_predict_grid,normalize=None) 
cmd = ConfusionMatrixDisplay(cm,display_labels=['No Anticipatore','Sì Anticipatore']).plot()
Model Test Recall: 0.9629629629629629
--------------------------------------------------
Model Test Precision: 0.9629629629629629
--------------------------------------------------
Model Test F1: 0.9629629629629629
--------------------------------------------------
Model Test Accuracy: 0.9636363636363636
--------------------------------------------------
Model Confusion Matrix

On the test set the positive result of the training phase is confirmed! It is possible to save the model pipeline in order to reuse it every time new data arrive, always in the format present in this POC.

5.0 Output

The output of this POC is a trained classifier that can be reused to make further predictions according to the provided data structure.

In [26]:
with open('meteo_adversity_anticipator.pkl','wb') as file:
    pickle.dump(model, file)
In [ ]: